First Steps with Stan

Maud goes to Vegas

Andrew B. Collier

andrew@exegetic.biz | DataWookie | DataWookie

Derivco Brown Bag

24 May 2018 at 12:00

Meet Maud

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100

Maud’s Burning Questions

  • What is the hit rate?
  • What is the RTP?
  • What is the hit rate for each winning combination?

Maud’s Data (by Session)

sessions
# A tibble: 100 x 8
   session details               spins  hits wager payout hit_rate   rtp
     <int> <list>                <int> <int> <int>  <dbl>    <dbl> <dbl>
 1       1 <data.frame [7 × 3]>      7     2    10      3    0.286 0.3  
 2       2 <data.frame [19 × 3]>    19     7    30     29    0.368 0.967
 3       3 <data.frame [19 × 3]>    19     3    22      3    0.158 0.136
 4       4 <data.frame [26 × 3]>    26     7    30     13    0.269 0.433
 5       5 <data.frame [23 × 3]>    23     8    31     35    0.348 1.13 
 6       6 <data.frame [20 × 3]>    20     8    26     12    0.4   0.462
 7       7 <data.frame [22 × 3]>    22     5    30     20    0.227 0.667
 8       8 <data.frame [22 × 3]>    22     4    25     10    0.182 0.4  
 9       9 <data.frame [18 × 3]>    18     4    26      6    0.222 0.231
10      10 <data.frame [26 × 3]>    26     8    33     75    0.308 2.27 
# ... with 90 more rows

Maud’s Data (by Spin)

spin <- sessions %>%
  select(session, details) %>%
  unnest() %>%
  mutate(success = as.integer(payout > 0))
# A tibble: 1,972 x 5
   session  spin wager payout success
     <int> <int> <int>  <dbl>   <int>
 1       1     1     1      1       1
 2       1     2     1      0       0
 3       1     3     1      0       0
 4       1     4     3      0       0
 5       1     5     2      0       0
 6       1     6     1      2       1
 7       1     7     1      0       0
 8       2     1     2      0       0
 9       2     2     2      0       0
10       2     3     1      1       1
# ... with 1,962 more rows

Q1: Hit Rate

Probability distribution for Bernoulli process:

\[ P(k | \theta) = \theta^k (1 - \theta)^{1 - k} \]

where

  • \(k = 0\) indicates failure;
  • \(k = 1\) indicates success; and
  • the probability of success on any trial is \(\theta\).

Probability distribution for binomial process:

\[ P(k | n, \theta) = \binom{n}{k} \theta^k (1 - \theta)^{n - k} \]

where

  • \(k\) successes in \(n\) trials; and
  • the probability of success on any trial is \(\theta\).

Multiple experiments: for session \(i\) there are \(k_i\) hits from \(n_i\) spins.

Combine probabilities for multiple experiments by multiplication.

  • Why is this potentially problematic?
  • How can we get around this?

Assuming that sessions \(i = 1, 2, 3, \ldots\) are independent:

\[ P(k | n, \theta) = \prod_i P(k_i | n_i, \theta) = L(\theta; n, k) \]

Log-likelihood for binomial process (multiple trials):

\[ LL(\theta; n, k) = \sum_i k_i \log{\theta} + (n_i - k_i) \log{(1 - \theta)} \]

# A tibble: 6 x 2
  theta log_likelihood
  <dbl>          <dbl>
1 0.31          -1225.
2 0.32          -1226.
3 0.3           -1226.
4 0.33          -1227.
5 0.290         -1228.
6 0.34          -1229.

Bayes’ Theorem

\[ p(\theta|y) = \frac{p(y|\theta) p(\theta)}{p(y)} \propto p(y|\theta) p(\theta) \] where

  • \(y\) are observations;
  • prior — \(p(\theta)\);
  • likelihood — \(p(y|X, \theta)\); and
  • posterior — \(p(\theta|y, X)\).

For a model with one parameter, life is simple.

But real life is not that kind: models with (very!) many parameters.

Monte Carlo (Plain Vanilla Version)

Generate independent samples of \(\theta^{(i)}\).

  • Need to have \(p(\theta^{(m)} | y, X)\).

Markov Chain Monte Carlo (MCMC)

Generate a series of samples: \(\theta^1\), \(\theta^2\), \(\theta^3\), … \(\theta^M\).

  • \(\theta^m\) depends on \(\theta^{m-1}\).
  • Need to have \(p(\theta^{m} | \theta^{m-1}, y, X)\).

Stan:

RStan:

Stan + R Workflow

  1. Choose a model.
  2. Write Stan program (.stan ) giving likelihood and priors.
  3. Wrap it in an R script (.R ).
  4. Stan parser converts this to C++.
  5. Compile C++.
  6. Execute compiled binary.
  7. Evaluate results. Optionally return to 2.
  8. Inference based on posterior sample.

Stan Skeleton

data {
  int<lower = 0> N;
  int<lower = 0, upper = 1> y[N];
}
parameters {
  real<lower = 0, upper = 1> theta;
}
model {
  y ~ bernoulli(theta);  
}

library(rstan)

# Use all available cores.
#
options(mc.cores = parallel::detectCores())

trials <- list(
  N       = nrow(spin),
  y       = spin$success
)

fit <- stan(
  file    = "bernoulli.stan",
  data    = trials,
  chains  = 2,                         # Number of Markov chains
  warmup  = 500,                       # Warmup iterations per chain
  iter    = 1000                       # Total iterations per chain
)

class(fit)
[1] "stanfit"
attr(,"package")
[1] "rstan"
samples <- as.data.frame(fit)

head(samples)
      theta      lp__
1 0.3326573 -1228.633
2 0.3136229 -1226.911
3 0.3276020 -1227.864
4 0.3203996 -1227.155
5 0.2942303 -1228.577
6 0.3166354 -1226.968
nrow(samples)
[1] 1000

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: bernoulli.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.29     0.31     0.33   293 1.01
lp__  -1227.41    0.04 0.78 -1229.49 -1227.12 -1226.91   430 1.00

Samples were drawn using NUTS(diag_e) at Fri Aug 24 05:10:44 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Use summary() to get information for each chain.

data {
  int<lower=0> N;
  int hits[N];
  int spins[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  hits ~ binomial(spins, theta);       // Likelihood
  theta ~ beta(2, 2);                  // Prior
}

print(fit, probs = c(0.025, 0.5, 0.975))
Inference for Stan model: binomial-beta-prior.
2 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=1000.

          mean se_mean   sd     2.5%      50%    97.5% n_eff Rhat
theta     0.31    0.00 0.01     0.29     0.31     0.33   359    1
lp__  -1228.99    0.04 0.86 -1231.39 -1228.66 -1228.45   466    1

Samples were drawn using NUTS(diag_e) at Fri Aug 24 05:12:41 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Q2: RTP

data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower = 0> mu;
  real<lower = 0> sigma;
}
model {
  rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
}
generated quantities {
  // Number of simulated samples = 25 * chains * active iterations per chain
  real simulated[25];
  for (i in 1:25) {
    simulated[i] = lognormal_rng(log(mu) - sigma * sigma / 2, sigma);
  }
}

Inference for Stan model: lognormal-rtp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
mu      0.80    0.00 0.07   0.69   0.76   0.80   0.84   0.94  1901    1
sigma   0.72    0.00 0.05   0.62   0.68   0.71   0.75   0.83  1852    1
lp__  -16.12    0.02 1.01 -18.77 -16.48 -15.81 -15.41 -15.16  1759    1

Samples were drawn using NUTS(diag_e) at Fri Aug 24 05:14:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

What’s the probability of breaking even?

mean(simulated$rtp > 1)
[1] 0.25117

What’s the probability of doubling your money?

mean(simulated$rtp > 2)
[1] 0.05319

Q3: Hit Rate per Combination

description symbols payout
0
1x mouse 🐭 1
1x cat 🐱 2
2x mouse 🐭🐭 5
2x cat 🐱🐱 10
3x mouse 🐭🐭🐭 20
3x cat 🐱🐱🐱 100
data {
  int<lower=0> N;
  real rtp[N];
}
parameters {
  real<lower=0, upper=1> theta[6];
  real<lower=0> sigma;
}
transformed parameters {
  real<lower=0> mu;
  mu = theta[1] * 1 +                  // Payline 1 -> 1x
       theta[2] * 2 +                  // Payline 2 -> 2x
       theta[3] * 5 +                  // Payline 3 -> 5x
       theta[4] * 10 +                 // Payline 4 -> 10x
       theta[5] * 20 +                 // Payline 5 -> 20x
       theta[6] * 100;                 // Payline 6 -> 100x
}
model {
  rtp ~ lognormal(log(mu) - sigma * sigma / 2, sigma);
  theta[1] ~ beta(2, 2);               // Mode = 0.5
  theta[2] ~ beta(2, 4);               // Mode = 0.25
  theta[3] ~ beta(2, 5);               // Mode = 0.2
  theta[4] ~ beta(2, 8);               // Mode = 0.125
  theta[5] ~ beta(2, 10);              // Mode = 0.1
  theta[6] ~ beta(2, 16);              // Mode = 0.0625
}

                mean      se_mean           sd         2.5%       97.5%    n_eff      Rhat
theta[1] 0.140072477 1.445106e-03 0.0886783678 0.0186286589 0.352326667 3765.615 0.9993099
theta[2] 0.068064522 6.711297e-04 0.0424459690 0.0102927430 0.168731470 4000.000 1.0001725
theta[3] 0.028892776 2.894141e-04 0.0183041579 0.0033953293 0.072914250 4000.000 0.9998855
theta[4] 0.014594931 1.437237e-04 0.0090898869 0.0019018620 0.036523596 4000.000 0.9999596
theta[5] 0.007441395 7.525103e-05 0.0047592927 0.0009537422 0.019112895 4000.000 1.0000193
theta[6] 0.001517744 1.482361e-05 0.0009375275 0.0002257116 0.003729426 4000.000 1.0005894

The 1x payout is triggered on average every 7 spins.

The 100x payout is triggered on average only every 659 spins.

Exit Maud!

Why Stan?

  • Extract maximum information from your data.
  • Learning curve is not as steep as you might think.

Do
cool things
with
Stan!

Extra Stuff

Metropolis-Hastings Algorithm

  1. Randomly sample \(\theta^{(0)}\).
  2. Set \(i = 1\).
  3. Randomly sample proposal \(\theta^{*}\) in the vicinity of \(\theta^{i-1}\).
  4. Sample \(u\) uniformly on \([0, 1)\).

\[ \theta^{(i)} = \left\{\begin{array}{ll} \theta^{*} & \text{if } u \cdot p(\theta^{i-1}|y, X) < p(\theta^{*}|y, X) \\ \theta^{(i-1)} & \text{otherwise.} \end{array}\right. \]

  1. Increment \(i\).
  2. Return to 1.

Poisson

# trials <- list(
#   N       = nrow(sessions),
#   spins   = sessions$spins
# )
# 
# fit <- stan(
#   file    = "poisson.stan",
#   data    = trials,
#   chains  = 4,
#   warmup  = 1000,
#   iter    = 2000
# )
# extract(fit)

# hist(extract(fit)$lambda)

Regression

# ggplot(sessions, aes(spins, wager)) + geom_jitter()

Regression Stan

data {
  int<lower=0> N;
  vector[N] y;                         // Total wager
  vector[N] x;                         // Number of spins (standardised)
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma); // Likelihood
  //
  alpha ~ normal(0, 10);               // Prior (Normal)
  beta ~ normal(0, 10);                // Prior (Normal)
  sigma ~ cauchy(0, 5);                // Prior (Cauchy)
}

Regression Stan R

# summary(fit)
… the theory of inverse probability is founded upon error and must be wholly rejected. Sir Ronald Fisher (1925)

\[ \text{hit rate} = \theta^* = \frac{\text{total hits}}{\text{total spins}} = \frac{\sum_i k_i}{\sum_i n_i} \]

with(sessions, sum(hits) / sum(spins))
[1] 0.3128803

\[ \text{hit rate for session } i = \theta^*_i = \frac{k_i}{n_i} \]

# A tibble: 1 x 2
  session_avg session_std
        <dbl>       <dbl>
1       0.310       0.103

95% confidence interval extends from 28.9% to 33.1%.

Maud’s Bayesian Answers

  • What is the hit rate? 31.3%
  • What is the RTP? 80.1%
  • What is the hit rate for each winning combination?
    • frequent low paying combinations (every 7 spins for 1x)
    • rare high paying combinations (every 659 spins for 100x)

ShinyStan

library(shinystan) launch_shinystan(fit) # # Diagnose -> PPcheck “Posterior predictive check” (look at distribution of observed data versus replications - do they look similar? “If our model is a good fit then we should be able to use it to generate data that looks a lot like the data we observed.”)

Resources